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Abstract 

^^This  report  summarizes  work  performed-->  by  Computer 
Corporation  of  AmericaWinder  the  ARPA  Very  Large  Databases 
,-prjigram.  ^ The  report  discusses  two  separate  and  quite 
f'  disjointed  tasks. 

Section  1„<''''''wr^tten  by  Robert  H.  DorTn^  describes  the 
Datacomputer  Technology  Transfer  project  (DTT).  This 
project  was  aimed  at  expansion  of  the  Datacomputer  user 
community  by  Increasing  the  awareness  of  this  unique  tool 
in  the  government  community The  plan  for  the  project 
^-involved  a series  of  general  publicity  efforts, 
/ presentations  at  potential  user  sites,  and  extensive 
/ technical  support  of  the  beginning  steps  of  new  users. 

V ^The  result  of  this  project  was  a substantial  increase  in 

^Datacomputer  utilization,  to  wit: 


180$  Increase  ihftotal  storage  used; 

- 181%  increase  in  monthly  processor  utilization;  and 

- 171%  Increase  in  the  number  of  active  system  users. 

Section  1 of  this  report  discusses  the  activities 
performed  under  the  DTT  project  and  elaborates  on  its 
results. 

Section  2 /^written  by  Joanne  Z.  Sattley , ^ describes  the 
Message  Archiving  and  Retrieval  System  (MARS)  project. 
The  MARS  effort  was  directed  toward  the  design  and 
implementation  of  a prototype  system  to  provide  economical 
storage  and  convenient  retrieval  of  Arpanet  mail.  The 
system  has  been  fully  implemented  and  is  operational  at 
CCA  on  an  experimental  basis.  This  report  describes  the 
MARS  prototype  as  well  as  concepts  for  future  extensions 
of  the  system. 


1 


Very  Large  Databases:  FTR 


Page  -ii- 
Table  of  Contents 


Table  of  Contents 


1 . Datacomputer  Technology  Transfer  Project 

1.1  Introduction 

1.1.1  The  Datacomputer 

1.1.2  Plan  of  the  DTT  Project 

1.2  Project  Activities 

1.2.1  Initial  Identification  and  Contact 

1.2. 1.1  Mailing  to  Arpanet  Community 

1.2. 1.2  Presentations  at  Conferences 

1.2. 1.3  Articles  in  Trade  Publications 
1.2.2  Presentation  of  Datacomputer 

Technology 

1.2. 2.1  Datacomputer  Presentations 

1.2. 2. 2 Seminars 

1.3  Datacomputer  Usage 

1.3.1  System  Utilization 

1.3.2  Short  Lead  Time  Users 

1.3.3  Long  Lead  Time  Users 
1.4  Summary 

2.  Message  Archiving  and  Retrieval  System 

2.1  Introduction 

2.2  MARS  Project  Overview 

2.3  Archiving 

2.4  System  Description 

2.4.1  The  Programs 

2.4.2  The  Message  Format 

2.5  Future  Directions 

2.5.1  On  Filing 

2.5.2  On  Retrieving 

2.5.3  Security  Issues 

A.  Seismic  Symposium  Paper 

B.  Gomputerworld  Article 

C.  Datacomputer  Presentation  Text 

D.  Datacomputer  Presentation  Slides 
References  & Bibliography 


Very  Large  Databases:  FTR  Page  -1- 

Datacomputer  Technology  Transfer  Project  Section  1 


1.  Datacomputer  Technology  Transfer  Project 


1.1  Introduction 

This  section  describes  the  activities  and  results  of  the 
Datacomputer  Technology  Transfer  (DTT)  project  of  the  VLDB 
program.  The  goal  of  this  project  was  the  expansion  of 
the  user  community  of  the  Datacomputer,  a network  data 
utility  developed  by  Computer  Corporation  of  America 
(CCA) , and  sponsored  by  the  Defense  Advanced  Research 
Projects  Agency  under  contracts  MDA903-7M-C-0225  and 
MDA903-711-C-0227. 

Section  1.2  of  the  report,  "Project  Activities",  describes 
the  tasks  performed  under  the  project.  Section  1.3 
discusses  the  key  result  of  the  project;  a substantial 
increase  in  the  level  of  system  utilization.  The  balance 
of  this  section  briefly  describes  the  Datacomputer  and 
discusses  the  plan  of  attack  for  the  DTT  project. 
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u I Tht  Dataeo«put«r  ia  a complete 
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handling  aystan  deaigned  to  support: 


1 

a 

1 

hardware/ software  data 

1 


1 . data  sharing  among  heterogeneous  computers  in  a 
network  environment  and 


2.  economic  storage  through  use  of  a mass  storage 
system. 

The  Datacomputer  has  been  offering  service  to  computers  on 
the  Arpanet  since  April,  1973.  The  first  complete, 
working  version  of  the  system  was  available  in  August, 

1975,  and  an  Ampex  Terabit  Memory  (TBM)  system  was 
installed  in  mid-1976  making  the  Datacomputer  the  first 
data  management  system  to  offer  convenient,  ready-to-use 
access  to  a mass  memory.  Version  2 of  the  Datacomputer 
which  included  a TBM  system  with  an  on-line  capacity  of 
200  billion  bits  (about  60  times  that  of  the  Version  1 
system)  has  been  available  to  Arpanet  users  since  October, 

1976.  More  details  of  the  system  configuration  and  its 
extensive  database  management  facilities  are  available  in 
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working  papers  and  technical  reports  from  CCA 


1.1.2  Plan  of  the  DTT  Projec 


The  intent  of  the  DTT  project  was  to  expand  the  user 


community  of  the  Datacomputer  by  increasing  the  level  of 


awareness  of  the  system's  capabilities  among  potential 
users.  The  plan  for  achieving  this  objective  involved  the 
following  three  steps: 


1.  Identify  and  make  initial  contac 


Ith  potential 


users.  This  step  involved  the  wide  < 


capsule  Datacomputer  informatiii 


conference  presentations,  and  af  tic 


response  to  this  activity  identified  a significant 


number  of  user  prospects 


2.  Present  more  detailed  information  and  discuss 


applications  with  identified  potential  users.  This 


activity  was  allocated  much  of  the  effort  under  the 


DTT  project.  It  involved  site  visits,  overview 


presentations,  and  day-long  training  seminars 
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3.  Support  and  advise  new  users.  When  organizations 
identified  through  steps  1 and  2 begin  to  use  the 


Datacomputer,  their  efforts  are  supported  by  the 


Datacomputer  staff  under  a separate  contract 


MDA903-77-C-0183 
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1.2  Project  Activities 

This  subsection  describes  the  tasks  performed  under  the 
DTT  project.  It  deals  with  steps  1 and  2 of  the  project 
plan: 

1 . identifying  potential  users  and 

2.  providing  them  with  detailed  system  information. 

1.2.1  Initial  Identification  and  Contact 

Many  of  the  users  in  the  Arpanet  community  were  either 
unaware  of  the  Datacomputer  and  its  availability,  or 
somewhat  misinformed  about  the  system.  Some  knew  of  it  as 
a storage  resource  but  did  not  know  about  the  system's 
extensive  software  facilities.  Others  were  aware  of  its 
existence  but  unaware  of  its  potential  availability  for 
their  use.  This  step  of  the  project  was  intended  to 
Increase  general  awareness  so  that  organizations  with 
potential  applications  would  at  least  know  enough  to. 
inquire  further. 
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The  techniques  used  to  achieve  this  broader  awareness 
were:  a mass  mailing  (section  1. 2.1.1),  presentations  at 
conferences  (section  1.2. 1.2),  and  a trade  publication 
article  (section  1.2. 1.3). 

1.2. 1.1  Mailing  to  Arpanet  Community 

A letter  describing  the  Datacomputer  was  mailed,  along 
with  an  overview  article  [MARILL  and  STERN]  and  a reply 
card,  to  Arpanet  users  in  the  U.S.  Government, 
particularly  within  the  Department  of  Defense.  This 
mailing  effort  produced  a considerable  response  including 
inquiries  from  the  following  organizations. 

- Defense  Communications  Agency 

- Air  Force  Systems  Command  (Andrews  AFB). 

- Defense  Mapping  Agency 
Rome  Air  Development  Center 
National  Library  of  Medicine 

- National  Bureau  of  Standards 

- U.S.  Army  CERL 
Rock  Island  Arsenal 

\ 
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- Tooele  Army  Depot 

- Department  of  Commerce  

- Naval  Coastal  Systems  Laboratory 

- Naval  Ship  Engineering  Center 

- NASA  Ames  Research  Center 

- Air  Force  Logistics  Command 

- Naval  Surface  Weapons  Center 

- U.S.  ARRCOM 

Those  who  responded  were  sent  additional  documentation, 
and  were  telephoned  to  discuss  applications  and  arrange  a 
presentation, 

1.2. 1.2  Presentations  at  Conferences 

CCA  was  invited  to  participate  in  the  Third  lEEE-CS 
sponsored  Workshop  on  Mass  Storage  Systems  in  Palo  Alto, 
California  on  April  5-6,  1977.  A variation  of  the 

standard  Datacomputer  presentation  (see  section  1.2. 2.1) 
was  delivered  in  a session  consisting  of  users  of  mass 
memory  devices.  CCA  also  participated  in  a panel 

discussion  on  the  use  of  mass  storage  to  support 
distributed  processing. 
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This  conference  yielded  additional  organizations  with 
potential  Datacoraputer  applications  including  the  U.S. 
i Army  Engineering  Topographic  Laboratory. 

! 

CCA  was  invited  to  submit  a paper  to  the  IEEE 
International  Symposium  on  Computer  Aided  Seismic  Analysis 
and  Discrimination  on  its  work  in  support  of  a large 
seismic  database.  A paper  entitled  ”Use  of  the 
Datacoraputer  in  the  Vela  Seismological  Network”  was 
written,  and  presented  to  this  conference  of . seismologists 
on  June  9,  1977  in  Falmouth,  Massachusetts.  A copy  of 

this  paper  is  enclosed  as  Appendix  A. 

This  conference  yielded  two  additional  contacts:  the 

National  Science  Foundation  and  the  U.S.  Geological 
Survey. 


Very  Large  Databases:  FTR  Page  -9- 

Datacoraputer  Technology  Transfer  Project  Section  1 


1.2. 1.3  Articles  in  Trade  Publications 

A publicity  article  describing  the  Datacomputer  was 
written,  and  arrangements  were  made  for  it  to  appear  in 
the  Computerworld  weekly  newspaper.  A copy  of  this 
article  from  the  May  9,  1977  issue  is  included  as  Appendix 
B.  Inquiries  for  more  information  were  made  from  20 
organizations.  Discussions  are  underway  to  have  a similar 
article  printed  in  the  Government  Data  Systems 
publication. 

1.2.2  Presentation  of  Datacomputer  Technology 

Once  prospective  users  were  identified,  it  was  important 
to  make  them  aware  of  the  technology  of  the  Datacomputer, 
particularly  the  benefits  provided  by  the  system.  In 
addition  to  the  hundreds  of  copies  of  articles,  manuals, 
and  working  papers  which  were  requested  by  representatives 
from  government  and  industrial  organizations,  the  DTT 
project  sought  to  meet  in  person  in  order  to  discuss  the 
Datacomputer  and  its  potential  use.  To  prepare  for  these 
meetings,  an  overview  presentation  (40  minutes)  and  a 
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training  seminar  (1  day)  were  developed.  These  are 
discussed  below. 

V 

1.2. 2.1  Datacomputer  Presentations 

A MO  minute  presentation  with  color  slides  was  prepared. 
The  text  and  slides  of  this  talk  appear  in  Appendices  C 
and  D.  The  talk  was  designed  for  application  managers, 
programmers  and  researchers.  The  presentation  emphasized 
features  of  the  system  which  are  unique  and  which 
characterize  those  applications  most  likely  to  enjoy 
substantial  benefits  from  use  of  the  system. 

The  Datacomputer  talk  was  delivered  at  several  government 
installations: 

Defense  Mapping  Agency  Topographic  Center 
Defense  Communications  Engineering  Center,  DCA 
National  Bureau  of  Standards 
Rome  Air  Development  Center 
as  well  as  other  organizations: 
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- University  of  Maryland,  Department  of  Information 
Systems  Management 


Rand  Corporation 


- Digital  Equipment  Corporation,  Federal  Systems 
Group 


On  several  occasions,  visits  to  CCA  were  arranged  for 
contacts  either  local  to  or  passing  through  the  Boston 
area.  In  addition  to  the  Datacomputer  presentation,  a 
tour  of  the  system  installation  was  given. 
Representatives  who  visited  CCA  during  the  project  came 
from: 


- the  Rand  Corporation 

- the  Department  of  Transportation 

- MIT  Nuclear  Engineering  Department 
the  Israeli  Armed  Forces 

The  Analytical  Sciences  Corporation 

- a research  group  from  Technical  University  in 
Braunschweig,  Germany 


1 .2.2.2  Seminars 


Some  Datacomputer  prospects  were  familiar  with  the 
concepts  of  the  system,  and  wanted  to  learn  more  about  the 
system  and  its  use.  A full-day  training  seminar  was 
prepared  for  such  prospects,  and  two  seminars  were  held 
during  the  DTT  project.  They  are  described  below. 

-) 

During  1976,  CCA  had  been,  in  touch  with  several 
laboratories  of  the  Energy  Research  and  Development 
Administration  concerning  use  of  the  Datacomputer.  To 
further  these  discussions  a full-day  seminar  was  held  at 
CCA  in  Cambridge,  Massachusetts  on  January  21,  1977. 
Detailed  information  about  the  Datacomputer  was  presented 
and  specific  database  applications  within  ERDA  were 
discussed.  Invitations  were  extended  to  several  ERDA 
labs,  and  representatives  attended  from; 
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Lawrence  Berkeley  Laboratory 
Brookhaven  National  Laboratory 
Argonne  National  Laboratory 


i' 
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MIT's  Laboratory  for  Nuclear  Science 
EROA  Headquarters  in  Washington,  D.C. 
ERDA  research  group  at  UCLA 
ERDA  research  group  at  NYU 


The  agenda  consisted  of  functional  and  architectural  1 

overviews  of  the  Datacomputer,  a tour  and  demo  at  the  CCA  j 

installation,  a tutorial  on  Datalanguage , and  a discussion 
of  ERDA  applications.  One  large  Datacomputer  application,  i 

involving  a weather  database,  was  planned  at  this  seminar, 
and  will  be  discussed  in  section  1.3>3> 

Many  Arpanet  users  are  located  in  California  so  another 
full-day  seminar  was  arranged  on  July  12,  1977  at  the 
Stanford  Research  Institute  with  the  help  of  Ms. 

Elizabeth  Feinler  of  the  Network  Information  Center.  The 
agenda  was  quite  similar  to  the  ERDA  seminar.  Several 
existing  Datacomputer  applications  were  discussed  rather 
than  specific  applications  of  those  in  attendance.  Time 
was  taken  after  the  seminar  and  during  the  breaks  to 
discuss  the  applications  of  some  members  of  the  audience. 

The  audience  of  45  included  representatives  from: 


Stanford  Research  Institute 
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1.3  Datacomputer  Usage 

t 

Overall  system  utlll2atlon  increased  significantly  during 
the  first  six  months  of  1977.  Much  of  this  increase 
resulted  from  the  greater  awareness  of  the  Datacomputer 
within  the  Arpanet  community. 

This  increased  utilization  has  occurred  within  the 
relatively  short  duration  of  the  DTT  project.  This  was 
possible  because  many  new  users  were  able  to  avail 
themselves  of  Datacomputer  software  which  was  already 
available  and  well  suited  to  their  task.  Others  required 
new  software  development.  These  latter  applications  are 
now  in  the  development  stage.  While  they  did  not  have  ah 
immediate  Impact  on  system  usage,  over  the  long  run  they 
will  add  substantially  to  the  user  community. 

Following  a discussion  of  overall  system  usage,  examples 
of  short  and  long  term  applications  will  be  presented. 


•.  .V  i 
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1.3.1  System  Utilization 

Three  factors  were  chosen  to  measure  the  Datacomputer 
utilization. 

1.  Storage  - This  is  a significant  indicator  of 
utilization  of  a database  system.  Furthermore 
since  the  Datacomputer  is  a remote  facility,  the 
storage  utilization  represents  very  directly  a 
burden  which  has  been  removed  from  the  user's  local 
resources. 

2.  CPU  time  - Since  the  data  management  facilities  of 
the  Datacomputer  include  sequential  searching, 
indexed  access  and  computational  features, 
processor  utilization  is  a relevant  measure  of 
system  work. 

3.  Connect  time  - This  measure  represents  a rough 
gauge  on  user  activity. 

All  three  of  these  factors  showed  substantial  increases 
over  the  measured  period.  The  storage  utilization  in  the 
Datacomputer  increased  by  180>  from  66  billion  bits  in 
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January  to  185  billion  bits  in  June.  This  is  the 


equivalent  of  approximately  230  3330-type  disk  packs.  The 


Storage  Utilization 


Figure  1.1 


monthly  storage  charges  are  shown  in  Figure  1.1 
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The  central  processing  utilization  increased  by  181%  from 
51.9  CPU  hours  in  January  to  1M6.1  CPU  hours  in  June,  and 

Processor  Utilization  Figure  1.2 


1977 

■I 

is  shown  in  Figure  1.2.  The  total  connect  time  from 
remote  users  to  the  Datacomputer  increased. by  160%  from 
1025  connect  hours  in  January  to  2661  connect  hours  in 

June,  having  reached  a high  water  mark  of  2892  hours  in  ; 

i 

May.  This  utilization  increase  is  shown  in  Figure  1.3.  ; 
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1.3.2  Short  Lead  Time  Users 

As  previously  mentioned,  the  increased  system  utilization 
was  a result  of  short  lead  time  applications  for  which 
software  was  immediately  available.  An  example  of  such  an 
application  is  the  Datacomputer  File  Transfer  Program 
(DFTP). 

The  DTT  project  was  successful  in  soliciting  users  who 
wanted  to  take  advantage  of  the  Datacomputer *s  economic 
file  storage  facilities.  DFTP  is  a program  developed  by 
CCA  which  provides  PDP-10  and  Multics  users  with  a set  of 
simple,  terminal-oriented  commands  to  store  and  retrieve 
files  on  the  Datacomputer.  It  has  a ready-to-use 
interface,  and  no  programming  is  required  at  the  user  site 
(for  PDP-10  and  Multics  systems).  This  is  a big  factor  in 
the  short  term  success  of  DFTP. 


f. 


DFTP  translates  the  user’s  commands  into  Datalanguage , the 
interface  language  of  the  Datacomputer,  and  handles  all 
data  transfers,  network  connections,  and  local  file  I/O 
operations.  DFTP  notifies  the  user  of  the  results  of  each 


? 


1 


1 


command 
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Six  new  sites  began  using  DFTP  during  the  DTT  project 
bringing  the  total  number  of  user  sites  to  20.  These  new 
DFTP  users  were: 

Rome  Air  Development  Center  (Griffiss  AFB) 
Carnegie-Mellon  University 

the  Air  Force  Armament  Development  and  Test  Center 
(Eglin  AFB) 

the  Packet  Satellite/Speech  project  at  ARPA 

- Air  Force  Avionics  Laboratory  (Wr ight-Patterson 
AFB) 

- National  Bureau  of  Standards 

The  number  of  distinct  DFTP  user  sessions  increased  from 
237  in  January  to  835  in  June.  Thif^'252%  increase  in  use 
reflected  the  popularity  of  DFTP.  The  number  of  active 

Jt* 

users  jumped  from  42  to  72  in  this  period.  Tho.ugh  DFTP 
was  originally  implemented  for  Tenex  systems,  it  is  now 
operational  on  several,  other  PDP-10  operating  systems 
including  T0PS-ia>  TOPS-20,  ITS  and  SAIL,  as  well  as  on 
the  Multics  system. 

During  the  DTT  project,  interest  in  a DFTP-like  facility 
was  expressed  by  several  Arpanet  users  running  the  Unix 
system  on  PDP-11 ‘s.  These  included  the  UCLA-Secur ity 
resear^ch  group,  Lawrence  Livermore  Laboratory,  Lincoln 
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Laboratories  Applied  Seismology  Group  and  the  Rand 
Corporation.  In  conjunction  with  the  Datacomputer 
presentation  at  the  Rand  Corporation,  a meeting  was  held 
to  arrange  a joint  effort  to  develop  a DFTP-like  facility 
for  Unix  systems.  This  effort  has  begun,  and  the  facility 
is  expected  to  be  available  to  Unix  users  in  the  fourth 
quarter,  1977. 

1.3*3  Long  Lead  Time  Users 

Some  applications,  which  are  well-suited  for  the 
Datacomputer,  will  nonetheless  involve  a long  lead  time. 
Seismological  and  meteorological  databases,  which  have 
high  input  rates,  may  involve  the  gathering  and  subsequent 
analysis  of  data  at  dispersed  geographic  locations.  A 
data  utility  supporting  network  access  and  a mass  memory 
system  is  ideal  for  these  applications.  The  process  of 
planning  and  designing  such  an  application,  though,  is  a 
fairly  long  term  effort. 

During  the  fourth  quarter,  1976,  a group  at  the  Division 
of  Energy  Conservation  at  Argonne  National  Laboratory 
(AND  began  to  experiment  with  the  Datacomputer.  This 
group  was  planning  to  build  a prototype  weather  database 
on  the  Datacomputer  for  use  by  the  CAL-ERDA  Building 
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Energy  Analysis  Project.  CAL-ERDA  is  a family  of  computer 
programs,  currently  under  development,  which  is  used  for 
analysis  of  energy  usage  in  buildings.  It  is  a 
cooperative  effort  of  ANL,  Lawrence  Berkeley  Laboratory 
(LBL),  Los  Alamos  Scientific  Laboratory  (LASL)  and  the 
U.S.  Army  Construction  Engineering  Research  Laboratory 
(CERL).  The  weather  data  to  be  stored  on  the  Datacomputer 
would  be  retrieved  by  several  laboratories. 

This  application  is  particularly  appropriate  for  the 
Datacomputer  because  it  involves  both  sharing  data  at 
widexy  dispersed  sites  and  the  storage  of  a very  large 
database.  However,  the  complexity  of  the  application 
leads  to  a longer  start-up  period  than  simple  DFTP-like 
uses.  CCA  has  been  actively  involved  in  this  start-up 
effort. 

CCA  arranged  for  the  group  from  ANL  to  meet  with  a 
representative  from  National  Oceanographic  and  Atmospheric 
Administration  (NOAA)  in  conjunction  with  the  ERDA  seminar 
in  January,  1977.  NOAA  has  access  to  a large  amount  of 
weather  and  solar  radiation  data  (totalling  more  than  10 
billion  bits)  from  the  National  Climatic  Center  in 
Asheville,  North  Carolina,  and  will  make  this  available 
for  the  CAL-ERDA  application.  Since  this  meeting,  CCA, 
ANL,  and  NOAA  have  been  working  on  the  design  of  this 


This  weather  database  effort  has  been  underway  for  several 
months,  and  continues  to  make  progress.  Such  applications 
will  bring  a steady  and  substantial  increase  in  system 
utilization  over  the  long  term. 

t 
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1 .4  Summary 


The  goal  of  the  Technology  Transfer  Project  was  to 
Increase  the  Datacomputer  user  community  by  presenting  the 
benefits  available  to  prospects.  The  Datacomputer  was 
presented  In  several  different  forums,  and  a great  deal  of 
Interest  was  generated.  Those  users  who  Indicated  a 
desire  to  use  the  system  were  given  the  assistance  they 
needed,  and  the  overall  system  utilization  increased 
substantially.  CCA  will  continue  to  maintain  contact  with 
all  of  the  active  prospects  identified  by  the  DTT  project, 
and  the  system  utilization  should  Increase  further  as 
planned  projects  begin  Implementation. 
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2.  Message  Archiving  and  Retrieval  System 


2.1  Introduction 


Under  • this  task  CCA  designed  and  implemented  a prototype 
version  of  MARS  (Message  Archiving  and  Retrieval  System), 
a system  which  provides  economical  storage  and  convenient 
retrieval  of  Arpanet  messages. 

MARS  achieves  inexpensive  storage  through  use  of  the 
Datacomputer  [MARILL  and  STERN],  a network  database 
utility  developed  by  CCA  under  ARPA  contracts, 
MDA903-74-C-0225  and  MDA903-74-C-0227 . The  Datacomputer 
offers  on-line  storage  at  a cost  which  is  2 orders  of 
magnitude  less  than  other  on-line  alternatives. 


Messages  archived  using  MARS  are  heavily  indexed  and  can 
be  retrieved  in  a variety  of  ways  including  Boolean 
combinations  of  message  recipients,  message  date  and  time, 
any  text  words  in  the  message  subject,  and  text  words  in 
the  message  body.  The  MARS  facilities  are  integrated  very 
naturally  into  the  existing  collection  of  message  handling 
tools: 
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A message  is  designated  for  archiving  by  sending  it 
to  the  MARS-filer  using  one  of  the  usual  message 
mailing  tools  such  as  SNDMSG. 

- A message  is  designated  for  retrieval  by  sending  a 
request  as  ordinary  mail  to  the  MARS-RETRIEVER. 

The  prototype  MARS  has  been  fully  implemented  and  placed 
in  service  at  CCA.  It  has  proven  to  be  an  effective  and 
popular  tool  in  this  community  of  mail  users. 

In  this  section  we  discuss  both  the  current  implementation 
of  MARS  and  some  general  archiving  concepts  which  suggest 
extensions  of  MARS  for  the  future: 

Subsection  2.2  - "Project  Overview"  presents  the  general 
goals  of  the  implementation  and  its  system 
architecture. 

Subsection  2.3  - "System  Description"  provides  details  of 
the  implemented  prototype. 

Subsection  2.4  -'  "Future  Directions"  discusses  a 
collection  of  improvements  and  extensions 
appropriate  for  a full  scale  implementation 
of  MARS. 
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2.2  MARS  Project  Overview 

Electronic  mail,  as  typified  by  the  Arpanet  mail  services, 
is  becoming  an  important  mode  of  inter-personal  and 
inter-organizational  communication.  In  the  Arpanet 
community,  thousands  of  individuals  and  hundreds  of 
organizations  use  these  services  routinely  for  the 
exchange  of  messages,  in  much  the  same  way  that  they  would 
have  used  telephones  or  paper  mail  in  the  past..  In  many 
situations  the  electronic  mail  systems  are  preferable  to 
these  older  alternatives  because: 

1.  They  are  much  faster  than  paper  mail. 

2.  They  are  often  cheaper  than  telephones. 

3.  They  are  often  more  convenient  than  telephones 
because  they  do  not  require  that  both  parties  be 
available  simultaneously. 

4.  They  create  a (potentially)  permanent  written 
record  of  the  correspondence. 
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The  model  of  voluminous  and  enthusiastic  use  of  electronic 
mail  which  has  been  established  in  the  Arpanet  community 
seems  likely  to  be  repeated  in  both  civilian  and  military 
organizations  outside  of  Arpanet. 

If  electronic  maxi  is  to  fulfill  its  potential  for  impact 
on  human  communication,  it  is  essential  that  effective 
tools  exist  for  handling  messages  throughout  their  entire 
life-cycle:  from  composition,  through  transmission, 
through  receipt  and  reading,  and  finally  to  archiving.  In 
the  Arpanet  mail  systems  today,  effective  tools  are 
available  for  composition,  transmision,  receipt  and 
reading.  However,  there  does  not  exist  a complete, 
effective  and  economical  mechanism  for  archiving  mail. 
The  MARS  project  has  focussed  on  the  implementation  of  a 
prototype  of  such  an  archiving  facility.  In  this  section 
we  present  the  design  goals  of  the  prototype  and  its 
general  architecture. 

The  MARS  prototype  was  designed  to  meet  the  following 
objectives: 

1 . Storage  economy  - In  a large  community  of 
electronic  mail  users,  such  as  the  Arpanet 
community,  the  volume  of  a message  archive  can  grow 
very  large  indeed.  One  active  user  of  an  Arpanet 
mail  facility  whom  we  studied  has  received  or  sent 
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25,000  bits  of  message  traffic  per  day  ovj r the 
last  two  years.  While  this  user  does  not  represent 
a random  sample  in  any  sense,  we  felt  that  his 
behavior  was  typical  of  active  users.  If  there 
were  1000  such  users,  11  IBM-3330  type  disk-packs 
would  be  required  to  hold  1 year  of  their  message 
traffic.  At  typical  on-line  storage  rates  this 
archiving  activity  would  cost  about  $350,000.  One 
clear  objective  of  a reasonable  archiving  facility 
is  to  lower  this  very  high  element  of  the  system 
cost. 

2.  Retrieval  quality  - The  raison-d *etre  of  a message 
archiving  facility  is  the  requirement  to  retrieve 
messages  from  time  to  time.  Because  the  total 
message  archive  will  be  very  large  and  the  time 

frame  over  which  messages  are  held  will  be  fairly 
long  (e.g.  several  years),  it  will  often  be 
difficult  to  find  a particular  message  in  the 
archive.  This  will  be  especially  true  for  the 
frequent  case  in  which  the  message  can  only  be 

vaguely  identified,  for  example  by  a specification 
like  ”A  message  sent  to  Walker  early  in  1977 

regarding  complete  security.”  In  today’s  military 
message  archiving  schemes  a message  can  only  be 
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retrieved  if  the  message  identification  number  is 
known.  Even  in  these  cases  the  retrieval  usually 
requires  days  to  complete.  This  kind  of  service 
exhibits  a low  retrieval  quality.  A key  design 
objective  for  MARS  was  to  achieve  flexible 
retrieval  requests  which  exhibit  good  precision  and 
recall  and  which  can  be  handled  in  a timely 
fashion. 

3.  Ease  of  use  - As  is  usual  for  systems  oriented 
toward  end  user  functions,  it  is  important  that 
MARS  be  easy  to  use.  Since  MARS  operates  in  an 
environment  where  there  are  already  well 
established  user  tools  and  procedures,  this 
requirement  translates  into  a need  for  a clear  and 
natural  integration  of  the  MARS  facilities  into  the 
existing  environment. 

The  MARS  prototype  meets  all  of  these  objectives; 
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1 . storage  economy  is  achieved  through  use  of  the 
Datacomputer  [MARILL  and  STERN],  a network  database 
utility  developed  by  CCA  for  ARPA.  The 
Datacomputer  employs  a mass  memory  system  called 
the  Ampex  Tera-bit  Memory  (TBM)  which  can 
economically  store  millions  of  messages  per  year. 
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The  use  of  TBM  Mass  Memory  for  storage  of  messages 
offers  dramatic  savings  in  storage  cost  compared  to 
alternative  on-line  devices.  For  example,  a single 
storage  component  of  a TBM  Mass  Memory  can  store  as 
much  data  as  100  3330-type  disks  [GREEN], 

[WILDMANN].  The  TBM  component  cost  is  $100,000, 
whereas  the  disk  units  would  cost  more  than 
$2,500,000.  Indeed  TBM  Mass  Memory  storage  costs 
are  comparable  to  the  cost  of  paper  storage  in 
filing  cabinets. 

2.  The  MARS  prototype  supports  the  retrieval  of 
messages  based  on  a variety  of  criteria  including 
Boolean  combinations  of  message  recipients,  message 
date  and  time,  any  text  words  in  the  message 
subject  and  text  words  in  the  message  body.  It 
will  retrieve  messages  based  on  these  criteria  in  a 


matter  of 

minutes. 
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sending  it  as  a message  to  the  MARS-RETRIEVER. 
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With  this  understanding  of  the  MARS  objectives  and  the 
general  scheme  for  achieving  them,  we  are  in  a position  to 
consider  the  system  architecture  in  more  detail.  Figure 
2.1  illustrates  the  basic  MARS  architecture.  As  depicted 
here,  the  architecture  involves  four  computers 
communicating  over  the  Arpanet.  These  are: 

1.  A User  Host  running  a message  system; 

2.  The  Archiver; 

3.  The  Retriever;  and 

4.  The  Datacomputer . 

Any  combination  of  1 , 2,  and  3 can  run  on  a single 

machine. 

The  MARS  architecture  will  be  described  here  by  tracing 
the  numbered  data  flow  arcs  in  Figure  2.1  beginning  with 
the  user  host.  Arcs  #1  - #2  constitute  the  archiving 
stage  of  MARS;  arcs  #3  - #6  constitute  the  retrieval 
stage. 
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Figure  2.1 
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2.3  Archiving 

Data  flow  arc  #1 ; A user  employing  an  ordinary  message 
system  designates  a message  for  archiving  in  one  of  two 
ways: 

1.  When  the  message  is  being  sent,  the  user  may 
include  the  addressee  ”Archiver§<archiver-site>”  in 
the  TO;  or  CC:  fields  of  the  message,  or 

2.  after  the  message  has  been  received  he  may  forward 

it  to  the  Archiver.  (In  the  long  run  message 
systems  could  be  modified  to  do  this 

automatically . ) 

In  either  case,  the  result  is  that  the  message  is  sent  to 
the  Archiver  site  and  is  stored  in  the  MESSAGE.TXT  file  of 
the  Archiver  directory.  It  is  important  to  note  that  this 
interfaee'is  compatible  with  all  standard  Arpanet  message 
systems  and  requires  no  reprogramming  of  existing  message 
systems. 


Data  flow  arc  #2;  Periodically  the  Archiver  program  wakes 
up  and  reads  the  messages  which  have  been  stored  in  its 
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MESSAGE.TXT  file.  It  Indexes  each  message  on  several 
fields  including:  TO,  FROM,  CC,  DATE,  every  word  of 
subject,  and  every  word  of  text.  Then  it  sends  each 
message  along  with  the  appropriate  control  information 
(expressed  in  Datalanguage ) to  the  Datacomputer  for 


Retrieval  Request 


Figure  2.2 


MARS-A  Retrieval  Request  (in  Arpanet  Message  Format): 

TO:  WALKER 

FROM:  ^OTHKIE  or  MARILL 

DATE;  lEFORE  MAY  1,  1977 

TEXT:  MARS  or  MESSAGE  ARCHIVING  , MEETING 


permanent  storage.  Retrieval:  The  process  of  retrieving 
messages  from  the  Datacomputer  involves  four  steps: 


1 .  first  a user  specifies  to  the  Retriever  which 
messages  he  desires; 


2.  This  specification  is  translated  into  datalanguage 
and  passed  on  to  the  Datacomputer; 


3.  The  Datacomputer  selects  the  specified  messages  and 
returns  them  to  the  Retriever;  and 
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The  Retriever  passes  the  messages  on  to  the  user. 
These  steps  are  described  further  below. 

Data  flow  arc  1^3 •'  A user  specifies  messages  to  be 
retrieved  by  preparing  a retrieval  request  (RR)  and 
sending  it  to  the  Retriever  as  a normal  message.  Two 
example  retrieval  requests  are  shown  in  Figure  2.2.  The 
format  of  an  RR  mimics  the  format  of  tlie  messages  it  is 
trying  to  retrieve.  The  example  shows  an  RR  requesting 
messages  sent  to  Walker  by  Rothnie  or  Marill  before  May  1 , 
1977,  in  which  the  word  groups 

- MARS  or 

- message  archiving 
appear  together  with 

meeting. 

Within  this  framework,  MARS  will  permit  users  to  specify 
arbitrary  Boolean  combinations  of  indexed  fields  as  the 
basis  for  message  retrievals. 

Data  flow  arc  Periodically  the  Retriever  wakes  up  and 

reads  its  MESSAGE.TXT  file.  For  each  RR  it  finds  there, 
it  formulates  a Datalanguage  request  and  sends  it  to  the 
Datacomputer  to  retrieve  the  specified  messages.  (This 
process  is  subject  to  security  controls  which  are  built 
into  the  Datacomputer.)  The  Retriever  may  send  the 
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Datalanguage  retrieval  request  to  any  the  Datacomputer 
component  and  the  Datacomputer  itself  will  determine  which 
of  its  distributed  modules  contains  the  requested  message. 

The  message  activated  form  of  retrieving  which  is 
described  here"  is  appropriate  for  most  retrievals. 
However,  the  complete  form  of  MARS  will  permit  interactive 
access  to  the  Retriever  to  accommodate  human  search 
behavior  and  to  provide  more  rapid  access. 


Data  flow  arc  #5:  the  Datacomputer  retrieves  the 
requested  messages  from  all  necessary  distributed 
components  and  returns  them  to  the  Retriever  for  temporary 
buffering  at  the  Retriever  site. 

Data  flow  arc  06:  As  the  final  step  in  retrieval,  the 
Retriever  passes  messages  from  its  temporary  buffer  to  a 
locally  resident  message  system  (e.g.,  SNDMSG)  and  "mails" 
them  back  to  the  requesting  site. 


In  summary  then:  The  MARS  capabilities  are  conveniently 
coupled  to  existing  message  systems.  MARS  receives 
messages  to  be  archived  as  ordinary  network  mail  and 
deposits  them  in  its  large  storage  resource,  the 

f-  I Datacomputer.  Requests  for  message  retrievals  are 

delivered  as  network  mail  and  mimic  the  format  of  the 


messages  to  be  retrieved.  The  desired  messages  are 
extracted  from  the  Datacomputer  and  delivered  as  mail  to 


J the  requesting  user. 

1 
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2.4  System  Description 

The  description  of  the  MARS  system  is  factored  here  into 
two  segments.  The  first  segment  is  a programmer’s  eye 
view  of  the  implementation.  The  second  segment  describes 
the  format  of  messages  in  the  MARS  database. 


2.4.1  The  Programs 

MARS  is  composed  of  three  operationally  independent 
programs,  a ’Filer’,  a ’Retriever’  and  a ’Retrieval 


1 
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Request  Preparer’.  This  segment  describes  the  key 
components  of  each  program  and  the  current  status  of  their 
development. 

All  of  the  MARS  programs  are  written  in  the  BCPL  language 
except  for  the  pre-existing  MACRO-coded  lowest-level 
Datacomputer  input/output  interface  package. 

The  Filer  and  Retriever  programs,  as  implemented  on 
CCA-Tenex,  are  intended  to  be  activated  automatically  and 
periodically  by  a program  demon. 
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The  MARS  Filer  Program  key  components  are:  Initialize, 
Get-Next-Message,  Index-Message,  Pack-Up-Message  and 
Transmi t-To-Datacomputer . 

The  Initialize  component  is  encoded  in  a way  that 
facilitates  the  compilation  of  both  a "production"  system 
as  well  as  a "debugging"  version  which  accepts  operating 
instructions  from  an  on-line  terminal.  (The  technique 
used  relies  upon  the  BCPL  compile-time  debugging  switch.) 
It  is  this  component  which  houses  most  of  the 
operating-system-dependent  functions. 

VIhile  these  functions  do,  typically,  concern  themselves 
with  generating  the  programming  environment,  creating  work 
files,  interfacing  the  program's  input/output  requirements 
to  the  system's  facilities,  and  the  like,  a specific  MARS 
problem  was  rather  handily  solved  for  us  by  the  Tenex 
System. 
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In  order  to  allow  for  the  delivery  of  new  mail  during  the 
archiving  process,  we  adopted  the  technique  of  renaming 
the  Tenex  MESSAGE.TXT;  1 file.  This  file  has  unique 
Tenex-based  attributes  which  prevent  its  deletion  whether 
deliberately  or  accidentally  ordered.  Thus,  the  renaming 
operation,  which  would  ordinarily  result  in  the  loss  of 
the  original  file  name,  here  causes  the  simultaneous 
creation  of  a fresh  MESSAGE.TXT; 1 file  ready  to  receive 
new  mail. 
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The  Get-Next-Message  coMponent  is  responsible  for  scanning 
the  ARCHIVE  input  file,  nee  MESSAGE.TXT;  1 ,*  for  isolating 
individual  messages,  and  for  determining  the  message's 
filing  mode. 

There  are  three  modes  of  filing  currently  supported  in  the 
MARS  prototype: 

— single-message  mode,  wherein  the  MARS-Filer  mailbox 
appears  in  the  message  as  an  addressee,  but  not 
as  the  primary  recipient. 

--  forwarded-message  mode,  wherein  the  MARS-Filer 

mailbox  appears  as  the  only  primary  recipient;  and 

--  batch  mode,  wherein  the  mailing  envelope  is 
addressed  to  MARS-Filer  and  the  subject-field 
contains  the  keyword  "batch". 

Since  there  does  not  yet  exist  an  ARPANET  standard  for  the 
format  of  messages,  the  variability  amongst  formats  is 
still  greater  than  the  Filer  can  handle  as  it  stands. 
Nonetheless,  a user  can  successfully  file  any  message  in  a 
"foreign"  format  by  forwarding  it  to  the  Filer  under  the 
aegis  of  a mail-handling  program  which  does  produce  good 
formats.  Admittedly,  the  correct  header-field  indexing, 
as  described  below,  will  not  be  done  on  the  enclosed 
message;  but  at  least,  the  words  in  its  unreadable  header 
fields  will  appear  as  "text"  words  in  the  indexing. 


! 
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In  the  case  of  single-message-mode  filing,  the  entire 
message  is  scanned  for  indexing  terms,  commencing  with  the 
DATE  field  of  the  message-header  and  terminated  either  by 
the  message's  byte  count  or  by  the  end-of-message 
indicator  supplied  by  the  message-composing  program.  (The 
Tenex  SNDMSG  program,  for  example,  standardly  appends 
seven  (7)  hyphens  followed  by  the  codes  for  a 
carriage-return  and  a line-feed  to  each  message  it 
generates.) 

In  the  case  of  forwarded-message-mode  filing,  all 
interesting  indexing  information  is  extracted  from  the 
message-header  prior  to  discarding  it.  The  name  of  the 
archiver,  the  date  and  time  the  message  was  forwarded,  and 
the  subject-line  information  are  recorded.  The  remainder 
is  handled  as  though  it  were  a non-forwarded  message  which 
had  been  CC’d  to  the  Filer. 

In  the  case  of  batch-mode  filing,  only  the  archiver's  name 
and  the  date  and  time  she/he  sent  the  package  are 
retained.  The  message-body  portion  is  treated  as  a series 
of  individual  messages. 

The  Index-Message  component  is  designed  to  be  capable  of 
isolating  every  parsable  token  in  a message.  Needless  to 
say,  the  adoption  of  standardized  message  formats  would 
facilitate  this  process. 


L., 
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The  Filer  "indexes",  in  effect,  on  everything  without 
analysis,  except  for  the  following: 

— Each  distinguishable  section  of  the  message  is 
indexed  separately;  each  header  line  is  a 
separate  inversion  domain,  as  is  the  body  of 
the  message. 

--  The  header  lines  which  contain  ARPANET  addresses 
are  analyzed  in  order  to  index  separately  on 
mailbox  and  host. 

--  The  DATE:  field  is  parsed  and  converted  to  a 
numeric  value,  to  allow  retrievals  of  the  form, 
'BEFORE<dete> ’ , ’SINCE<date> * , etc. 

A scanner  control  was  designed  to  comply  with  the  message 
syntax  standards  proposed  by  ARPA's  Committee  on 
Computer-Aided  Human  Communication  (CAHCOM)  in  RFC724 
(which,  in  turn,  retains  the  minimum  formatting 

characteristics  proposed  in  its  predecessors  RFC561  and 
RFC680).  One  requirement  is  that  the  DATE  field  be 
unique.  We  further  require  that  it  be  the  first 

message-header  item.  Another  proposed  standard,  which  is 
a MARS  standard  as  well,  is  that  the  message-header  fields 
be  separated  from  the  message-text  by  a double 
carriage-return/line-feed  (i.e.,  a blank  line). 
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For  each  message,  a vector  of  parsed  tokens  is  created. 
The  parsed  tokens  are  collected  by  the  message-field  in 
which  they  occurred  - to  be  used  as  "indexes",  i.e., 
values  of  inverted  fields,  by  the  Datacomputer. 

These  tokens  are  packed  up,  along  with  the  message  as  it 
will  be  retrieved,  for  storage  on  the  Datacomputer. 

This  method  provides  all  the  flexibility  we  need  in  order 
to  scan  all  the  various  message-header  fields  that  current 
message-composing  programs  can  produce  and  to  decide  which 
of  them  may  subsequently  be  used  for  retrievals. 

The  following  list  defines  the  message-header  fields  and 
affiliated  field-names  which  are  identified  as  such  by 
Index-Message,  and  sketches  their  interpretation.  The 
terminology  is  adopted  from  RFC724. 

<date-field>  The  <date-time>  information  is 

converted  to  the  standard  Tenex 
internal  date/time  format,  which 
is  better  adapted  for 
less-t han/greater- than 
comparisons,  as  in  retrievals 
which  specify  a date  range. 

<originator>  The  <mach-from-f ield>  is  the  only 

option  adequately  identified  so 
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as  to  permit  its  selection  by  ^ 

4 

<f ield-name> , i.e.,  "From”,  on  ^ 

the  Retrieval  Request.  The  syntax  j 

following  the  <field-name>  is 
limited  to  a <host-phrase>  which 
is  defined  to  be  an  <atom> 
followed  by  a <host-indicator> . 

The  mailbox  scanner  breaks  down 
the  <host-phrase>  so  that 
retrievals  may  be  performed  with  ♦ 

or  without  a <host-name>. 


<addressee-f ield>  All  of  the  <address-f ield> 

<f ield-name>s  are  recognized. 
The  syntax  of  the  field  following 
the  field-name  (To,cc,bcc,Fcc)  is 
limited  to  a <host-phrase> . 

<extension-f ield>  We  recognize  the  "Subject"  and 

"Message- Id"  field-names  and 
handle  them  separately.  The 
other  field-names  ("In-Reply-To" , 
"Keywords",  "References", 

"Comments" , <user-def ined-f ield>) 
are  collected  in  a default  bin  of 
two-element  terms,  the  first  of 
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which  is  the  field-name,  and  the 
second  of  which  is  the  remainder 
of  the  input  line  exactly  as 
scanned. 

The  "Subject"  field  merits 
special  attention  because  of  its 
generally-  accepted  purpose  of 
tersely  conveying  the  essence  of 
a message.  One-character  words 
are  arbitrarily  discarded  though 
they  need  not  be;  hyphenated 
phrases,  i.e.  , words  bound 
together  by  hyphens,  are  retained 
intact.  (There  is  even  the 
practice  of  sending  "ethereal" 
mail  via  bodyless  messages!  The 
entire  message  is  incorporated  in 
the  "Subject"  field.) 

The  "Message-Id"  field  also 
receives  special  attention.  In 
fact,  if  a message  is  received 
without  one,  a unique  Message-Id 
is  created  f'>r  it.  It  is  the 
contents  of  this  field  tested  in 
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conjunction  with  the  archiver's 
name  which  would  enable  us  to 
isolate  and  eliminate  duplicates 
when  retrieving. 

The  composition  of  the  Message-Id 
field  is  as  follows: 

< [From-Host]Sent-Date  Time 
From-Name> 

<raessage-text>  The  start  of  this  field  is 

recognized  by  the  blank  line 
which  is  expected  to  follow  the 
message-header  portion  of  the 
message,  and  it  runs  until  the 
message  byte-count  is  exhausted. 

"Words"  and  hyphenated  phrases 


are  collected,  eliminating 
duplicates  and  "bubble"  sorting 
in  the  process  so  that  the 
resultant  list  reflects  the 

position  of  the  original  I i 

1 ^ 

occurrence  of  the  term,  modified  | | 

by  its  frequency  of  appearance  in  ^ I 

the  message.  | 

y 
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Two-character  words  are  discarded 
though  they  need  not  be.  And, 
for  now,  only  the  first  thousand 
unique  terms  are  processed.  Of 
these,  only  the  first  127  (the 
counting  limit  of  a 7-bit  byte) 
are  actually  used  for  indexing. 
Full-text  indexing  would  not 
require  reprogramming  here,  but 
see  Pack-Up-Message  constraints. 

The  Pack-Up-Message  component  resolves  all  conflicts 
between  the  scanned  message  and  the  format  of  the  data 
expected  to  be  transmitted  to  the  Datacomputer . 

The  data  sent  is  constrained  to  match  exactly  the  format 
specified  in  a Datalanguage  PORT  description.  This 
component  needs  "re-tuning"  whenever  FILE  and  PORT 
descriptions  are  modified.  The  Datalanguage  description 
of  a recent  MARS  model  file  is  appended  to  the  end  of  this 
section. 

The  Transmi t-To-Datacomputer  component  is  the  Datacomputer 
interface.  It  contains  the  calls  to  open  a file,  send 
'records*  of  information,  and  to  close  a MARS  file  during 
the  course  of  a filing  session.  Messages  are  sent,  one  at 
a time  after  indexing. 


/ 
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In  the  event  of  network,  Tenex  or  Datacomputer  failure, 
MARS  crashes;  it  is  impossible  to  continue  filing  once 
the  processing  has  been  interrupted.  We  do  provide  for 
generating  a back-up  disk  file  of  indexed  messages  and 
have  succeeded  in  accomplishing  delayed  (later)  filing  by 
direct  call  on  the  RDC  program. 

The  MARS  Retriever  Program  key  components  are; 
Initialize,  Get-Next-RR,  Scan-RR,  Perform-Retr ieval  and 
Transmi t-To-User . 

The  Initialize  component  is  similar  to  the  Filer 
program's  Initialize;  it  contains  all  the  Tenex-dependent 
input/output  interface  routines. 

The  Get-Next-RR  component  is  comparable  to  the  Filer 
program's  Get-Next-Message;  a strictly-formatted  message 
is  expected. 

The  current  implementation  requires  that  MARS-Retriever 
appear  as  the  primary  recipient  of  the  message.  The  CC: 
and  SUBJECT:  fields  are  ignored. 

The  Scan-RR  component  scans  the  message-body  portion  of 
the  message  to  construct  a table  of  tokens  which  will  be 
used  to  compose  the  Datalanguage  for  performing  the 
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The  following  list  defines  the  message-header  field  names 
which  are  recognized,  and  some  notes  on  their 
interpretation.  The  scanning  of  each  field  is  terminated 
by  a carriage-return. 


DATE; 


The  format  of  the  date  field  is 
day-month-year.  Use  of  hyphens  is 
optional.  This  field  will  cause  only 
those  messages  composed  on  the  specified 
date  to  be  retrieved. 


AFTER; 


Use  of  this  field  will  retrieve  messages 
composed  after  the  specified  date. 


SINCE;  This  field  is  interpreted  like  the 
AFTER;  field. 


BEFORE; 


Use  of  this  field  will  retrieve  messages 
composed  before  the  specified  date. 


UNTIL;  This  field  is  interpreted  like  the 
BEFORE;  field. 


FROM; 


This  field  is  expected  to  contain  a 
valid  mailbox  name.  The  host 
specification  is  optional.  If  more  than 
one  name  is  specified,  ORing  of  the 
names  is  implicit.  Retrieval  based  upon 


! f 
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host  specification  alone  has  not  been 

implemented . 

TO:  This  field  is  expected  to  contain  one  or 

more  valid  mailbox  names.  The  host 
specification  is  optional.  Spaces  and 
commas  between  the  names  imply  AND. 

SUBJECT:  Use  of  this  field  will  retrieve  all 

messages  whose  indexed  subject-field 
contents  match  the  specified  word(s). 

Spaces  and  commas  imply  AND.  The  use  of 
OR  must  be  explicit. 

TEXT:  Use  of  this  field  will  retrieve  all 

messages  whose  indexed  message-body 
contents  match  the  specified  word(s). 

Spaces  and  commas  imply  AND.  The  use  of 
OR  must  be  explicit. 

The  Perform-Retr ieval  component  contains  the  Datacomputer 
interface  routines. 

The  Transmi t-To-User  component  mails  the  retrieved 
messages  one  at  a time.  They  will  appear  as  new  mail 
appended  to  the  requester's  MESSAGE.TXT  file. 
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The  MARS  Retrieval  Request  Preparer  is  an  interactive 
program  which  was  derived  straightforwardly  from  the 
Retriever  program. 

The  input  stream  scanner  was  lifted  from  the  Retriever's 
Get-Next-RR  component  and  a rudimentary  interactive 
controller  was  applied  to  organize  the  program  flow. 

The  Retriever's  Transmi t-to-User  component  was  adapted  to 
transmit  the  Retrieval  Request  message  to  MARS-Retriever 
and  to  send  a copy  of  it  to  the  user  (as  more  new  mail). 

Utility  print  routines  and  debugging  aids  were  included  in 
toto  and  used  as  the  basis  for  programming  the  Retrieval 
Request  formatter. 
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2.4.2  The  Message  Format 


Two  file  models  have  been  developed  for  filing  indexed 
messages,  the  major  difference  being  the  anticipated 
message  length.  Messages  of  less  than  1000  lines  use  one 
model;  longer  messages  (up  to  10,000  lines)  use  the  other. 
This  latter  definition  has  sufficed  to  file  all  of  the 
ARPANET  Special  Interest  Group  MsgGroup  correspondence,  a 
collection  which  is  well-noted  for  its  prolixity.  We  may, 
at  a later  time,  define  more  models  tailored  to  specific 
applications. 

The  Datalanguage  description  of  the  current  message  format 
is  given  below. 
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CREATE  IX  FILE  LIST  (,1000) 
MESSAGE  STRUCTURE 
DATE-FIELD  STRUCTURE 

DATE  BYTE,  B=18 

TIME  BYTE,  B=18 

END  /»  DATE-FIELD  »/ 

NET-INFO  STRUCTURE 


DATE 

BYTE,  B 

= 18 

TIME 

BYTE,  B 

= 18 

END 

/»  NET- 

INFO  » 

/ 

MESSAGE-ID 

STRING 

(,79), 

C=l, 

I=D 

FROM-NAME 

STRING 

(,79), 

C=l, 

I = D 

F ROM-HOST 

STRING 

(.79), 

C=l, 

I = D 

SUBJECT-LIST- 

COUNT 

BYTE, 

B=7 

\ SUBJECT-LIST  LIST  (,127),  C=SUBJECT-LIST-COUNT 

SUBJECT-WORD  STRING  (,79),  C=l,  1=1 
, ARCHIVER-NAME  STRING  (,79),  C=1 

ARCHIVER-HOST  STRING  (,79),  Csl 
RECIPIENTS-LIST-COUNT  BYTE,  B=7 

RECIPIENTS-LIST  LIST  (,127),  C=RECIPIENTS-LIST-COUNT 
RECIPIENT  STRUCTURE 


RECIPIENT-NAME  STRING  (,79),  Csl,  I=I 
RECIPIENT-HOST  STRING  ( ,79) , Csl,  I=I 
END  /*  RECIPIENT  */ 


KEYWORDS-LIST-COUNT  BYTE,  B=7 

KEYWORDS-LIST  LIST  (,127),  C=KEYWO RDS-LIST-COUNT 
KEYWORD  STRING  ( ,79) , C=1 
TEXTWORDS-COUNT  BYTE,  B=7 

TEXTWORDS  LIST  (,127),  C=TEXTWORDS-COUNT 

TEXTWORD  STRING  ( ,79) , C=1 , I=I 

MESSAGE-BODY  LIST  (,1000) 

LINE  STRING  (,511),  D=31 
END;  /*  MESSAGE  */ 


t 


[ 
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2.5  Future  Directions 

In  the  form  described  in  this  report,  MARS  already  offers 
a unique  though  rudimentary  service  to  ARPANET 
correspondents.  Its  adaptability  in  processing  various 
message  formats  while  keeping  pace  with  the  most  advanced 
offerings  in  the  field  augurs  well  for  the  life  expectancy 
of  the  system. 

This  part  of  the  report  explores  several  avenues  of 
potential  development  using  MARS  as  a base,  and 
consolidates  our  opinions  on  these  matters  as  best  we  can 
formulate  them  at  this  time. 

The  ideas  are  gathered  into  three  major  parts;  first, 
filing  enhancements;  second,  retrieval  enhancements;  and, 
third,  security  issues. 
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2.5.1  On  Filing 


Improved  indexing  would  result  from  incorporating  a 
facility  for  using  synonyms  and  root  analysis  in  message 
indexing.  The  indexing  scheme  of  the  Filer  is  intended  to 
maximize  the  liklihood  that  a message  regarding  a given 
topic  will  be  retrieved  in  response  to  a request  from  a 
user  interested  in  that  topic.  To  accomplish  this  end, 
the  Filer  must  use  more  information  than  the  exact  text  of 
the  message  itself. 

Specifically,  MARS  could  be  enhanced  by  using  synonym 
dictionaries  so  that,  for  example,  ”DOD”  and  ’’Department 
of  Defense”  and  "Defense  Department’’  would  be  indexed  the 
same  way. 

The  addition  of  root  analysis  would  ensure  that,  for 
example,  system  and  systems  are  indexed  identically. 

The  most  frequent  words  in  message-texts  tend,  of  course, 
to  be  the  particles  — ’this’,  ’which’,  ’with’,  etc., 
which  are  unlikely  to  be  of  much  use  in  retrieval.  When 
we  undertake  the  preparation  of  pre-loaded  dictionaries 
for  synonyms  and  root  analysis,  we  shall  include  in  them  a 
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standard  list  of  proscribed  words  to  be  ignored  for 
indexing  purposes. 

Paper-clipping  characterizes  a facility  which  would  enable 
a user  to  archive  a message  in  such  a way  that  it  would  be 
retrieved  as  part  of  a set  of  messages  with  which  it  might 
not  otherwise  have  been  associated. 

One  concept  explored  is  meant  to  resemble  that  of  a 
physical  paper  clip:  something  which  binds  messages 
together  so  that  the  retrieval  of  one  implies  the 
retrieval  of  all. 

Forward  chaining  is  another  linking  technique  worth 
considering.  If  a message  refers  to  an  earlier  message  in 
a manner  detectable  by  the  Filer  - e.g.,  in  a WITH 
REFERENCE  TO:  or  IN  REPLY  TO:  header  line,  or  in  an 
annotation  added  by  the  archiver  - then  a standard-form 
identification  of  the  earlier  message  would  be  placed  on  a 
"Refers  to"  indexing  (inverted)  list  in  the  current 
message  together  with  the  contents  of  the  Refers-to  list 
of  the  earlier  message  itself. 

Later,  when  someone  comes  across  an  important  message, 
s/he  can,  with  a simple  additional  request,  retrieve  all 
later  messages  stemming  from  the  earlier  one.  (For 


reasons  of  documentary  integrity,  the  MARS  files  have  been 
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designed  so  that  a message,  once  filed,  cannot  easily  be 
i 

modified;  hence  the  original  message  cannot  conveniently 
^ . be  updated  to  point  to  later  messages  which  refer  to  it.) 


2.5.2  On  Retrieving 


Bt 

I 

£ 

[ 

I 

r 
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Interactive  retrieval  in  the  MARS  context  connotes  a 
facility  for  composing  retrieval  requests  and  retrieving 
messages  on-line.  While  most  retrievals  are  expected  to 
be  performed  off-line,  it  would  occasionally  be  useful  to 
perform  this  operation  interactively. 

Other  techniques  which  were  considered,  and  might  still  be 
candidates  for  future  development  are: 

--  collecting  all  the  retrieved  messages  into 
a single  file,  readable  by  mail-handling 
programs,  and  then  FTPing  that  file  directly 
into  the  requester's  directory;  but  this 
requires  having  access  (password  or  other 
privilege)  to  that  directory  on  the  foreign 
host. 

— collecting  the  retrieved  messages  as  above, 

and  mailing  the  file  as  a single  long  message; 


Very  Large  Databases:  FTR  Page  -59- 

Message  Archiving  and  Retrieval  System  Section  2 

the  disadvantage  to  this  is  that  many  mail-handling 
programs  are  unable  to  "go  down  a level"  and 
explore  the  structure  of  a "message"  which  is 
really  a file  of  messages.  The  user  would  be 
unable  to  survey  headers,  etc. , of  the  contained 
messages. 

--  collecting  the  retrieved  messages  into  a single 
file  as  above,  but  then  placing  that  file  on  the 
Datacomputer  with  DFTP , and  sending  the  requester 
just  a short  message  telling  him  where  to  find 
it;  this  is  quite  feasible  and  may  yet  be  done. 

It  does  require  that  the  user  be  on  a host  which 
is  equipped  with  a DFTP  program,  and  that  s/he  be 
able  and  willing  to  use  it. 

Tolerant  retrieval  implies  the  existence  of  a facility  for 
dealing  with  common  errors  so  that  the  value  of  a 
retrieval  request  is  seldom  completely  lost  due  to  user 
error.  Such  a capability  is  important  since  most 
retrieval  takes  place  in  an  off-line  fashion  with, the  user 
unavailable  to  correct  an  error  quickly. 

If  the  system  is  too  unforgiving,  a frustrating  sequence 
of  unsuccessful  requests  could  easily  ensue.  Tolerance 
would  involve  actions  such  as: 
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correcting  common  misspellings  in  headers; 

limiting  the  size  of  excessive  retrievals 
by  retrieving  only  message  headers  or 
truncated  messages;  and 

--  automatically  weakening  the  restrictiveness 

of  a retrieval  request  if  no  messages  are  found. 
(This  could  involve,  for  example,  dropping 
conjunctive  clauses  or  expanding  a range 
of  dates.) 

Flexible  retrieval  denotes  a facility  for  retrieving  just 
portions  of  messages  rather  than  the  complete  text.  This 
would  include  surveying  header  information  or  simply 
counting  the  records  satisfying  some  condition. 

Automatic  retrieval  might  be  of  greater  interest  in  a more 
strictly  commercial  environment.  It  entails  a facility 
for  generating  a retrieval  request  on  behalf  of  a user  in 
response  to  some  event. 

This  mechanism  could  be  used  in  conjunction  with  a 
"tickler  file"  to  remind  a user  of  something  on  a given 
date.  For  every  user  employing  this  feature  the  Retriever 
would  prepare  a Retrieval  Request  each  day  and  retrieve 
those  messages  stored  in  the  tickler  file  indexed  under 


that  date. 
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Specification  of  retrieval  authority  involves  providing  a 
facility  by  which  the  user  who  archives  a message  may 
specify  who  may  retrieve  it.  Default  authority  would  be 
granted  to  all  users  listed  in  the  TO:,  CC:,  or  FROM: 
fields. 

Variations  on  default  authority  may  be  specified  by  the 
user  who  archives  a message  through  the  use  of  a 
RETR( lEVERS) : special  handling  field.  The  specifications 
in  this  field  are  sequences  of  3 types  of  terms: 

user  names; 

distribution  list  names;  and 
distribution  list  names  followed 
by  an  EXCEPT  clause. 

An  EXCEPT  clause  could  be  employed  to  specify  names  to  be 
deleted  from  the  distribution  list. 

Audit  trails  would  provide  a facility  for  recording  all 
retrievers  of  designated  messages.  If  the 


special-handling-designator  AUDIT  were  specified  for  a 
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message,  MARS  would  keep  a record  of  each  instance  of 
retrieval  of  that  message.  The  record  would  include  the 
identity  of  the  retriever,  ARPANET  host  and  socket  number, 
and  date  of  retrieval.  The  audit  trail  is  intended  to  aid 
in  the  analysis  of  potential  security  breaches  and  as  a 
deterrent  to  unauthorized  access. 

Record  traffic  issues  require  a facility  for  designating 
an  archived  message  as  an  ’’official"  copy  of  a sent 
message.  This  capability  requires  that  MARS  be  able  to 
Justify  substantial  user  confidence  that  the  message'in 
the  archives  is  in  fact  a true  copy  of  the  originally  sent 
message  with  the  same  message  ID.  This  could  be 
accomplished  with  relatively  great  promise  of  success  in 
several  ways. 

Basically,  a message  would  only  be  archived  as  "record 
traffic"  if  the  sender  of  the  message  were  the  same  as  the 
user  who  is  archiving  the  message,  and  the  message  is 
being  sent  to  MARS-Filer  as  part  of  the  original  TO:  or 
CC:  list.  This  would  ensure  that  the  archived  message  is 
a true  copy  unless  the  mailer  program  was  somehow 
subverted . 

A second  means  of  ensuring  true  copies  is  a feedback  loop 
in  which  the  message  received  by  MARS-Filer  is  sent  to 
some  human  authority  for  approval.  This  authority  may  be 


Page  -65- 


use  OF  THE  DATACOMPUTER  IN  THE  VELA  SEISMOLOCICAL  NETWORK 

Robert  H.  Dorln  and  Donald  E.  EasClake  III 
Computer  Corporation  of  America 
575  Technology  Square 
Cambridge,  Massachusetts  02139 


Tlic  Datacomputer  Is  a very  large  capacity,  cen- 
tralized database  management  facility  for  distributed 
networks.  It  was  developed  for  the  Advanced  Research 
Projects  Agency  (ARPA)  by  CCA,  and  Is  currently 
available  on  the  Arpanet.  The  Datacomputer  Is  the  pri- 
mary storage  and  retrieval  resource  for  t!<e  Vela 
Selsmologlcal  Network.  Several  different  types  of 
seismic  data  are  stored  on  the  Datacomputer.  The  size 
of  the  database  Is  growing  at  a rate  of  about  30 
billion  bits  per  month.  File  organizations  attempt  to 
make  the  most  significant  data  rapidly  available  to 
seismologists  on  Velanet.  Special  communications  pro- 
tocols were  designed  and  a dedicated  processor  employed 
to  accomodate  high  bandwidth,  real-time  seismic  array 
data. 

1.  Introduction 

The  Datacomputer  Is  a network  data  utility 
developed  by  CCA  and  designed  to  handle  large  files  and 
communicate  with  multiple  remote  programs  on  the  Ar- 
panet. This  system  Is  the  result  of  a research  and 
development  effort  sponsored  by  the  Advanced  Research 
Projects  Agency  (ARPA)  of  the  Departs\ent  of  Defense. 

The  Nuclear  Monitoring  Research  Office  (NKRO)  of  ARPA 
has  established  a seismic  data  network,  and  the  Data- 
computer Is  the  primary  storage  and  retrieval  resource 
being  utilized  by  the  effort. 

This  seismic  data  activity  Involves  the  collec- 
tion, storage  and  processing  of  seismic  waveform 
Information  (seismograms)  as  measured  by  seismometers 
Installed  throughout  the  world.  The  data  will  assist 
seismologists  In  exploring  techniques  for  detecting 
seismic  events,  pinpointing  their  location,  and  recog- 
nizing the  causes  of  these  events.  A major  application 
of  the  work  Is  the  detection  of  underground  nuclear 
tests  In  preparation  for  future  Strategic  Arms  Limita- 
tion Treaties.  By  establishing  an  on-llna,  real-time 
database  of  seismic  Information  from  a world-wide  net- 
work of  monitoring  sites,  a graat  deal  of  data  can  be 
made  easily  available  to  computers  In  the  network  for 
seismic  analysis  and  othar  purposes. 

This  paper  describes  the  storage  and  retrieval 
of  data  In  support  of  this  objective.  Sactlon  2 out- 
lines the  communications  facllltlas  which  support  the 
collection  and  distribution  of  the  sclamlc  data. 

Section  3 dascrlbes  the  storage  facility,  the  Datacom- 
puter.  In  Section  4,  tha  nature  of  the  data  atorad  In 
the  system  Is  explained,  while  In  section  5,  Its  physi- 
cal organization  Is  described.  Finally,  section  6 con- 
siders certain  detailed  communications  Issues  which 
have  arisen  In  handling  this  application. 

2.  The  Vela  Selsmologlcal  Network 

The  computers  at  various  sites  Involved  In  the 
gathering  and  aubacquent  analysis  of  selamlc  data  ara 
known  as  the  Vela  Selsmologlcal  Network  or  Vclanat. 
Since  many  of  the  computars  are  on  the  Arpanet,  the  Ar- 
panet was  chosen  as  the  most  appropriate  conniunlcatlons 
Mdlum  available  for  the  entire  system.  Tha  Arpanet  Is 
a geographically  distributed  computer  communications 
network  which  was  designed  to  provide  for  the  sharing 
of  data  and  computing  resources  among  Its  users.  Cur- 


rently there  are  over  100  computers  (called  "hosts") 
connected  to  Arpanet,  as  shown  In  Figure  1. 
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Arpanet  utilizes  a technique  called  "packet 
switching"  to  transmit  digital  information  from  host 

to  host.^  Packet  switching  is  a communications  dis- 
cipline in  which  data  Is  divided  Into  small  (usually 
ona  to  two  thousand  bit)  segments  called  packets. 

Each  packet  Is  then  augmented  with  additional  Informa- 
tion Including  its  destination,  sequencing,  and  an 
error  checking  code.  The  packet  is  then  launched  Into 
s network  connecting  the  communicating  points.  The 
network  Is  composed  of  communications  lines  and  active 
computer  "switching"  nodes.  Tha  switching  nodes  check 
for  correct  transmission,  and  will  retransmit  a packet 
If  transmission  errors  are  detected.  The  nodes  dynam- 
ically determine  the  best  routing  In  order  to  avoid 
congestion  and  lines  which  are  out  of  service.  The 
transmission  Is  nearly  as  rapid  as  a dedicated  channel 
but  with  superior  error  characteristics. 

The  structure  of  the  Velanet  appears  In  Figure  2. 
The  Velsnet  consists  of  two  sites  sending  seismic 
waveform  Information  In  real-time,  the  Large  Aperture 
Seismic  Array  (LASA)  In  Montana,  and  the  Norwegian 
Selamlc  Array  (NORSAR).  LASA  data  Is  transmitted  via 
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i Ipasi'd  telephone  lines  to  an  Intermediate  processor  at 
the  Seismic  Data  Analysis  Center  (SUAC)  In  Alexandria, 

'•  Virginia.  NORSAR  data  arrives  at  SDAC  via  a satellite 
communications  link  of  Arpanet. 

Data  which  is  not  transmitted  in  real-time 
arrives  at  SDAC  on  magnetic  tapes  from  the  Iranian 
Long  Period  Array  (ILPA) , as  well  as  from  other  in- 
strument clusters  throughout  the  world.  Non-array 
seismic  data  is  sent  by  magnetic  tape  from  various 
locations  around  the  world  to  the  Albuquerque  Seis- 
mological  Laboratory  (ASL)  of  the  U.S.  Geological  Sur- 
vey. Both  the  real-time  and  non-real-time  data 
arriving  at  SDAC,  as  well  as  the  data  concentrated  at 
ASL,  are  forwarded  through  the  Arpanet  to  CCA.  The 
current  traffic  of  seismic  data  to  the  Datacomputer  is 
12  kb  around  the  clock  storing  approximately  30  billion 
bits  per  month.  There  are  plans  for  additional  sites 
which  may  boost  the  traffic  volume  up  to  35  kb. 

Processors  throughout  the  Valanet  can  retrieve 
the  seismic  data.  Processors  at  SDAC,  Lincoln  Labs 
Applied  Seismology  Group  (LL-ASG)  and  possibly  else- 
where will  be  used  by  seismologists  for  this  purpose. 
The  nature  of  the  data  in  the  files  stored  on  the  Data- 
computer is  discussed  in  Section  4. 

3.  The  Datacomputer 

The  Datacomputer  is  a very  large  scale  data 
storage  utility  with  substantial  data  management  capa- 
4 

bllltlcs.  Its  design  is  heavily  Influenced  by  its 
use  as  a resource  in  a network  and  its  support  of  a 
trillion-bit  mass  memory  device.  Though  database  man- 
agement systems  have  been  in  wide  use  in  the  past 
decade,  the  network  orientation  and  built-in  Interface 
to  a mass  memory  make  the  Datacomputer  a unique 
resource. 

By  providing  service  as  a dedicated,  special- 
purpose  database  "machine"  in  a network,  the  Datacom- 
puter is  sharable  by  all  computers  having  access  to  the 
network.  User  programs  running  on  network  hosts  com- 
municate with  the  Datacomputer  across  the  Arpanet. 

These  programs,  may,  in  turn,  communicate  with  terminal 
users,  as  shown  in  Figure  3.  Of  particular  Importance 
to  the  Velanet  computers  is  the  support  of  data  shar- 
ing among  dissimilar  machines.  The  Datacomputer  soft- 
ware Includes  data  conversion  facilities,  and  will 
perform  translation  between  various  hardware  repre- 
sentations and  data  structuring  concepts. 


Datacomputer  User's  View 


The  architecture  of  tlie  system  Is  shown  In 
Figure  4.  TIte  system  proces.sor  Is  a DFi:  .System-IO 
(PDP-10),  running  the  Tenex  operating  system.  The 
Datacomputer  is  Interfaced  to  the  Arpanet  Interface 
Message  Processor  (IMP),  which  in  turn  Int  rfaces  to 
two  50  kilobit/second  telephone  lines  Into  the  network. 
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Figure  4. 


One  of  the  requirements  of  the  seismic  data 
analysis  project  is  very  large  on-line  storage  capac- 
ity. This  requirement  was  met  by  the  acquisition  and 

integration  of  an  Ampex  Terabit  Memory  System  (TMI)^, 
the  first  public  installation  of  this  device.  The 
TBM  has  a maximum  on-line  capacity  of  3.2  trillion 
bits.  A significant  advantage  of  this  technology  is 
an  extremely  low  per-blt  cost,  about  $l/megablt.  This 
is  on  the  order  of  20  times  cheaper  than  other  on-line 
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alternatives  . The  mass  storage  devices  currently  on 
the  market  all  have  high  price  tags,  with  configura- 
tions ranging  from  several  hundred  thousand  to 
several  million  dollars,  but  because  of  their  enormous 
capacities,  their  cost  per  bit  is  low. 

Access  time  to  the  TBM  is  substantially  slower 
than  disk  storage.  The  average  access  to  any  portion 
of  the  three  trillion  bit  mass  memory  is  15  seconds. 

In  order  to  Improve  the  effective  speed  of  the  device, 
the  Datacomputer  software  Includes  extensive  facili- 
ties to  "stage"  data  from  tertiary  TBM  storage  to 
secondary  disk  storage.  These  routines  attempt  to 
minimize  the  overhead  Involved  in  using  a relatively 
slow  access  memory  device.  When  data  which  has  not 
already  been  staged  is  referenced  by  a user's  request, 
a portion  of  the  database  surrounding  the  referenced 
data  is  staged  to  the  faster  disks.  Some  files  are 
staged  in  their  entirety,  while  others  are  staged  in 
extents  whose  size  may  vary  among  files.  Once  staged, 
data  will  remain  on  disk  for  repeated  accesses,  from 
one  or  more  users,  until  the  staging  area  becomes 
overcrowded.  The  preferential  de-staging  of  least 
recently  used  and  of  unmodified  data  is  part  of  a 
strategy  to  optimize  use  of  the  staging  area. 

The  Datacomputer  offers  a complete  set  of  data- 
base management  facilities  for  retrieval  and  mainte- 
nance, data  description,  and  access  control.  Since 
the  files  stored  on  Velanet  are  rather  large  (single 
files  as  large  as  4 billion  bits),  the  capability  to 
select  small  portions  of  the  database  efficiently  Is 
very  significant.  The  use  of  multiple,  user-selected 
inversions  provides  rapid  access  to  qualified  records 
based  on  their  content.  These  Inversion  tables  are 
maintained  by  the  system,  and  the  user  need  not  be 
aware  of  their  existence.  Such  efficient  retrieval 
mechanisms  will  assist  seismologists  in  selecting  seis- 
mic data  containing  certain  attributes,  such  as  loca- 
tion and  magnitude,  and  then  verifying  theories  about 
the  waveform  signals  associated  with  those  attributes. 

All  the  data  management  facilities  of  the  Data- 
computer are  handled  through  a uniform,  high-level 
notation  called  Datalanguage.  Datalanguage  is  the  com- 
municatlona  vehicle  between  the  Datacomputer  and  the 
remote  processors  using  this  data  utility  on  the  net- 
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work.  Further  details  concerning  tliese  d.-itahase  facil- 
ities are  described  in  working  papers  and  technical 

. 1.2 

reports. 

4.  Seismic  Data  Files 

There  are  two  basic  categories  of  files  stored  on 
the  Datacomputer  for  the  seismic  project.  First,  the 
raw  data  and  status  files  provide  complete  seismic 
readings  at  various  Instruments  and  arrays  around  the 
world,  as  well  as  associated  information  on  the  status 
of  these  Instruments  so  that  the  raw  readings  can  be 
properly  Interpreted.  Second,  there  are  the  derived 
event  summary  and  associated  seismic  waveform  files. 

These  consist  of  a processed  distillation  of  the  first 
category  into  likely  seismic  events,  and  the  time  seg- 
ments of  raw  data  that  are  associated  with  these 
events.  The  derivation  of  the  second  category  of  files 
from  the  first  is  performed  by  processors  at  the  Seis- 
mic Data  Analysis  Center  in  Alexandria,  Virginia. 

Raw  data  files  are  organized  by:  type  of  sensor 
(array  or  non-array),  frequency  of  measurement  (long 
and  short  period),  location  of  sensor,  and  time  Inter- 
val (day  or  month  of  measurement) . 

Array  data  originates  from  the  Large  Aperture 
Seismic  Array  (LASA)  In  Montana,  the  Norwegian  Seis- 
mic Array  (NORSAR) , and  the  Iranian  Long  Period  Array 
(ILPA).  LASA  and  NORSAR  are  real-time  on-line  altes. 

The  data  Is  sent  through  the  Arpanet  to  a central 
collection  point  at  SDAC.  A processor  there  re-trans- 
mlts  the  data  to  CCA.  ILPA  data  la  sent  to  SDAC  on 
magnetic  tape,  and  then  stored  Into  the  Datacomputer 
via  the  Arpanet.  Each  of  these  arrays  performs  mea- 
surements at  two  frequencies:  long  period  (1  sample 
per  second)  and  short  period  (10-20  samples  per 
second).  Long  period  data  Is  stored  In  monthly  files 
(files  containing  all  observations  for  a month)  while 
short  period  data  Is  stored  In  dally  files. 

Non-array  data  Is  collected  from  single  Instru- 
ments at  Seismic  Research  Observatories  (SRO's)  around 
the  world  and  recorded  on  magnetic  tape.  It  is  then 
gathered  at  the  Albuquerque  Seismological  Laboratory 
(ASL)  and  forwarded  to  the  Datacomputer  over  the 
Arpanet.  This  data  Is  also  divided  Into  long  and 
short  period  files  having  one  and  twenty  samples  per 
second  respectively.  All  non-array  (SRO)  data  Is 
stored  In  multi-site  day  files.  This  organization 
makes  it  most  convenient  to  locate  all  of  the  data 
related  to  a single  seismic  event. 

Most  of  the  status  Information  stored  in  the 
Datacomputer  concerns  real-time  changes  in  the  status 
of  Instruments.  A single  stream  of  monthly  status 
files  is  maintained  for  the  on-line  LASA  and  NORSAR 
arrays.  This  file  will  Include  status  for  the  instru- 
ments of  real-time  arrays  to  be  added  in  the  future, 
as  well  as  real-time  status  entries  concerning  the 
communications  lines  and  processors  Involved  in  the 
on-line  network.  Separate  status  streams  are  main- 
tained for  the  ILPA  non-real-time  array  Instruments 
and  for  the  SRO  sites. 

Each  entry  in  the  status  files  Includes  the  time 
of  the  entry,  information  on  the  station  and  instru- 
ment Involved,  and  data  relating  to  the  status  of  the 
Instrument.  This  data  Includes  indications  of  commu- 
nications errors,  data  being  absent,  data  marked 
invalid  by  operator,  or  calibration  in  progress. 

There  is  also  a flag  to  indicate  that  the  status  entry 
is  the  result  of  a change  in  status  or  the  result  of  a 
dump  of  status  for  many  Instruments  regardless  of 
change.  Status  dumps  currently  exist  only  for  the 
on-line  Instruments,  and  are  automatically  produced 

37 


once  a day  and  after  communications  disruptions  in 
order  to  recover  from  any  change  in  status  that  might 
have  been  lost.  Determination  of  the  status  of  an 
instrument  at  a particular  time  requires  searching  the 
Datacomputer  files  for  the  last  entry  concerning  the 
Instrument  before  that  time.  The  dally  status  dumps 
Into  the  status  file  limit  the  Interval  needed  to  be 
searched.  As  sientioned  above,  entries  are  also  made 
for  the  status  of  the  processors  Involved  in  the  on- 
line Velanet. 

The  culmination  of  the  raw  data  and  status  files 
are  the  event  summary  files  and  their  parallel  seis- 
mic waveform  files.  These  each  come  in  preliminary 
and  final  forms. 

The  event  summary  files  are,  at  the  highest  level 
a list  of  arrivals.  Each  event  has  a unique  event 
number,  assigned  by  SDAC,  and  information  about  the 
location,  depth,  and  magnitude  of  the  event,  as  well 
as  further  information  on  the  source  and  confidence 
level  of  these  conclusions.  Finally,  for  each  event, 
there  is  a second  level  list  of  arrivals  of  the  seis- 
mic signals  at  instruments  resulting  from  the  event. 
For  each  arrival,  this  includes  information  identi- 
fying the  station,  channel,  time  of  arrival,  and  con- 
siderable information  about  the  station  and  its 
relation  to  the  event. 

No  actual  seismic  readings  are  included  in  the 
event  summary  files  but  sufficient  information  is  pre- 
sent to  locate  the  relevant  areas  in  the  raw  data 
files  or,  as  explained  below,  in  the  seismic  waveform 
files. 

The  seismic  waveform  files  contain  actual  seg- 
ments of  seismic  data  that  are  associated  with 
arrivals  in  event  files.  Each  waveform  file  entry 
carries  the  event  number  from  the  event  summary  file 
entry  with  which  the  data  is  believed  to  be  associated 
as  well  as  information  about  the  station,  channel, 
relation  between  the  station  and  associated  event, 
and  the  start  time  and  duration  of  the  data  segment. 

The  preliminary  event  summary  and  seismic  wave- 
form files  are  derived  by  SDAC  from  the  on-line 
arrays  whose  data  is  available  in  real-time.  A latter 
cycle  is  planned  which  will  take  data  from  the  non- 
array SROs,  the  off-line  arrays,  and  the  detections 
from  the  short  period  SRO  data  and  other  sources  to 
produce  the  final  event  summary  and  final  seismic 
waveform  files. 

5.  Seismic  Database  Organization 

The  nature  of  the  data  and  its  use  varies  among 
the  different  seismic  files,  as  does  their  size  and 
structure.  The  organization  of  the  seismic  data 
attempts  to  enhance  the  practical  access  to  this  data- 
base in  which  individual  files  may  be  as  large  as 
4 billion  bits. 

Chronological  Organization 

One  uniform  characteristic  of  all  of  the  present 
seismic  files  is  their  organization  into  chronological 
streams.  Each  item  represents  some  event  or  state  at 
a point  in  time.  Thus,  in  some  sense,  a sequence  of 
data  of  a. particular  type  represents  a segment  out  of 
a potentially  infinite  stream  of  that  type  of  data. 

These  potentially  infinite  streams  must  be  broken 
up  into  physical  files  for  several  reasons.  The  seis- 
mic data  will  fill  many  physical  volumes  (TBM  tapes), 
and  the  Datacomputer  requires  that  physical  files 
reside  on  a single  volume.  The  current  Datacomputer 
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implementation  also  puts  a limit  of  slightly  greater 
than  4 billion  bits  on  the  size  of  single  files.  The 
subdivision  of  the  seismic  data  streams  into  sections 
allows  the  Datacomputer  directory  to  rapidly  access 
the  chronological  area  of  Interest  and  determine  the 
TBM  tape  on  which  the  data  resides.  Sections  of  the 
data  atreams  are  more  managable,  and  direct  indices 
to  such  sections  can  be  built  more  easily.  Finally, 
in  the  case  of  some  status  and  calibration  files,  each 
file  contains  the  state  of  the  Inatrumenta  at  the 
start  of  ita  time  period.  Thla  makes  reference  to 
earlier  files  in  that  stream  unnecessary. 

A uniform  scheme  haa  been  employed,  using  the 
hlerarchial  directory  available  in  the  Datacomputer. 
Beneath  several  directory  levels  specifying  the  exact 
type  of  data,  the  data  stream  la  divided  into  years, 
in  moat  cases,  the  years  subdivided  into  months,  and, 
in  some  caaea,  the  months  divided  into  days. 

File  Croups 

The  seismic  data  atreams  are  divided  into  a 
sequence  of  physical  files  to  facilitate  their  creation 
and  manipulation.  However,  it  la  sometlmea  convenient 
to  view  a stream  of  one  type  as  a single  logical 
entity  and  Ignore  these  physical  boundaries.  This  is 
done  with  the  file  groups  feature  of  the  Datacomputer 
which  provides  multiple  file  data  seta. 

File  groups  allows  an  application  to  treat  a 
group  of  files  aa  a single  file,  from  which  retrievals 
may  be  done  in  the  same  way  aa  from  a normal  file. 
Operations  are  available  for  adding  and  deleting  phy- 
sical file  entries  in  a group,  and  a file  may  appear 
in  the  list  of  constituents  for  any  number  of  groups. 
So,  a file  group  la  a logical  set  of  files  which 
facllltatea  retrievals  from  a seismic  data  stream 
which  has  been  divided  into  several  physical  files. 

Some  seismic  file  groups  may  Include  thousands  of 
day  files,  and  it  would  be  impractical  to  retrieve  from 
that  group  if  the  Datacomputer  had  to  access  each  of 
ita  constituent  files.  In  order  to  optimize  references 
to  the  physical  files,  a general  facility  la  available 
whereby  a logical  constraint  la  placed  on  the  data 
occurring  in  each  file  in  the  file  group.  When  a 
request  ia  issued  on  a file  group,  the  Datacomputer 
can  determine  which  individual  files  must  be  accessed 
by  checking  the  retrieval  condition  of  the  request 
against  the  logical  constraints  on  the  files  in  the 
group.  For  the  aelamlc  files,  such  constraints  nor- 
mally include  limits  on  the  values  of  date  fields.  By 
using  these  conatrainta,  a request  on  a file  group  for 
a one  hour  interval  of  data  will  actually  access  only 
one  or  two  physical  files  (two  if  the  hour  falls  acroaa 
the  boundary  between  files). 

6.  Communlcationa  Considerations 
Real-Time  Communications 


load  depending  on  time  of  day  and  other  factors.  It 
cannot  guarantee  the  required  reaponaivoness.  In 
addition,  both  the  basic  computer  system  on  which  it 
is  Implemented,  a Digital  Equipment  Corporation 
PDP-10  with  the  TENEX  operating  system,  and  its  perl- 
pherala  including  the  TBM,  require  regular  preventa- 
tive maintenance,  and  hence  cannot  be  operated  con- 
tinuously. 

To  provide  the  required  round-the-clock  respon- 
aivenesa,  a amall,  dedicated,  reliable  system  known 
aa  the  Seismic  Input  Processor  (SIP)  has  been  imple- 
mented. The  SIP  la  a Digital  Equipment  Corporation 
PDP-11/40  with  two  RP-04  (l.e.,  IBM  3330-llke)  disks 
and  an  Arpanet  Interface.  It  accepts  the  data  stream 
from  the  Communication  and  Control  Processor  (CCP)  at 
SDAC,  buffers  it  on  ita  disk,  reformats  the  data,  and 
periodically  tranamlts  it  to  the  Datacomputer.  At  the 
current  bandwidth,  26  hours  of  buffering  are  provided 
per  disk  pack. 

Thua  the  SIP  completely  isolates  the  real-time 
data  stream  from  Datacomputer  downtime  or  delay.  The 
SIP  alao  provides  a convenient  real-time  operator 
message  facility  whereby  operations  personnel  at  the 
Seismic  Data  Analyala  Center  and  at  CCA  can  keep  each 
other  appraised  of  ayatem  status. 

The  real-time  nature  of  these  communications  alao 
made  it  advisable  not  to  uae  the  standard  Arpanet 
hoat-to-hoat  protocol.  Thla  standard  protocol  ia 
imposed  on  top  of  the  basic  message  tranamlssion 
facilities  of  the  Arpanet  to  allow  hoata  to  dynamical- 
ly utilize  multiple  "virtual"  connections.  However, 
for  the  Velanet  real-time  links,  all  of  the  connec- 
tions are  known  in  advance,  and  the  standard  initial 
connection  and  status  logic  is  not  required.  Further- 
more, the  standard  protocol  provides  no  sequence  num- 
bering and  is  constrained  to  not  more  than  one 
message  in  flight  on  a given  connection  if  the  connec- 
tion la  to  support  automatic  host  level  retransmission 
on  errors.  The  protocol  used  on  the  seismic  links 
provides  ita  own  sequence  numbers,  acknowledgements, 
and  checksums.  It  can  allow  multiple  meaaagea  in 
flight  up  to  the  limit  imposed  by  the  basic  Arpanet 
(eight  at  present)  without  loaa  of  other  capabilities. 

High  Bandwidth 

Some  problems  have  been  encountered  due  to  the 
relatively  high  bandwidtha  of  data  transmitted  over 
the  Arpanet  in  this  project.  The  SIP  periodically 
bursts  its  accumulated  data  to  the  Datacomputer,  and 
utilizes  very  high  bandwidtha  frequently  exceeding  the 
SO  kilobits  per  second  rate  at  which  the  Arpanet 
conBunicationa  lines  operate.  However,  the  SIP  and 
the  Datacomputer  are  physically  adjacent,  and  are 
locally  connected  to  the  same  network  node,  the  CCA 
Interface  Message  Processor  (IMP) . The  data  between 
them  need  not  traverse  any  outside  coranunicationa 
line. 


The  real-time  data  from  the  lASA  and  NORSAR 
arrays  is  sent  through  the  Seismic  Data  Analyala  Center 
in  Alexandria,  Virginia,  before  arriving  at  CCA.  At 
present,  it  conaiata  of  two  Arpanet  messages  per 
second,  one  for  each  array  (this  will  change  when  new 
protocols  are  implemented  aa  explained  in  the  follow- 
ing section).  The  real-time  processing  and  trana- 
mlsaion  elements  leading  to  CCA  provide  a short  inter- 
val (currently  less  than  20  seconds)  of  elasticity  in 
the  system.  A facility  must  exist  to  accept  the  data 
at  CCA  rapidly  and  continuously. 

The  Datacomputer  system  ia  implemented  within  a 
general  purpose  time-sharing  system  with  s varying 


Some  problems  have  arisen  due  to  congestion  in 
the  Arpanet.  To  alleviate  these  problems,  the 
initial  Velanet  protocols  have  been  redesigned  for 
greater  efficiency.  Formerly,  each  logical  message, 
be  it  data,  an  acknowledgment,  an  operator  message, 
or  whatever,  was  sent  as  a physical  message  over  the 
Arpanet.  In  the  new  version  of  the  protocol,  several 
logical  messages  are  normally  packed  into  physical 
messages  of  two  types.  One  type  contains  any  logical 
meaaage  except  acknowledgments  of  correct  receipt. 

The  second  type  contains  acknowledgments  of  the  first 
type  of  message. 


Page  -69 


This  packing  schene  nlnlnlzes  the  Arpanet  over- 
head per  Message  by  maxlHlalng  the  Information  per 
message  and  mlnlsdzlng  the  number  of  messages.  Fur- 
thermore, Inside  the  Arpanet  and  Invisible  to  Its 
users,  messages,  which  can  be  as  large  as  8000  bits, 
arc  divided  Into  packets  of  1008  data  bits.  By 
accuanilatlng  the  data  and  packing  It  Into  full  size 
Mssages,  full  packets  are  assured,  as  well  as  the 
minimal  packet  overhead  per  data  bit. 

7.  SuMiary 

The  Datacomputer  provides  the  Vela  Selsraologlcal 
Network  shared  access  to  a large  capacity,  data  man- 
agement utility.  Careful  file  design  and  special 
conoiunlcatlons  protocols  were  necessitated  by  the 
real-time  volume  of  seismic  array  data.  This  approach 
makes  one  of  the  largest  scientific  databases  In  the 
world  available  on-line  to  processors  anywhere  on  the 
Arpanet . 
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Very  Large  Databases:  FTR 


Datacomputer  Presentation  Text 
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Th«  d«t«bs3t  servlets  ere  svstlsbU  without  duplication  of 
effort.  The  DC  Is  s epeclsllzed  dste  Benegeaent 
Instellstlon  with  eleborete  fecllltles  for  that  purpose. 
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This  application  of  a shared  database  makes  extensive  use 
of  the  network  facilities  of  the  DC.  For  those  not 
familiar  with  the  Arpanet,  it  is  an  international  network 
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The 

DATACOMPUTER 
A Network  Data  Utility 


CompiMf  CoipofWcin  o4  Anwrica 
575  Tachnotofy  SqiMra 
C«nMdg«,  Man.  02139 
#17)  491-3870 


Outline 

• CCA 

• Management  Overview 

• Datacomputer  User’s  View 

• Datacomputer  Architecture 

• Datacomputer  Applications 

• Summary 
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Computer  Corporation  of  America 

• Founded  in  1965 

• Offices  in  Cambridge  and  Washington 

• Specializes  in  Advanced  Database 
Management  and  Communication  Systems 


Research  and  Development 


• Access  Methods 

• Data  Structures 

• User-Interaction  Languages 

• Teleprocessing 

• Natural-Language  Query  Systems 

• Distributed  Database  Management 

• Ultra-Large  On-Line  Storage  Media 

• Resource-Sharing  Computer  Networks 
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Software  Products 

• Model  204  Database  Management  System 

• TDA  Computer  Message  System 


1 


DATACOMPUTER 


• R&D  Project  for  ARPA  1 

• 70  Man-Year  Effort  I 

• Development  Completed  m December  1976 
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Management  Overview 

• What  is  the  Datacomputer? 

• Why  is  the  Datacomputer  of  Interest? 

• What  is  a Datacomputer  Like  Physically? 

• Where  are  there  Datacomputer  Installations 

Today? 

• What  are  Some  Applications  of  the 

Datacomputer  Today? 

• How  Can  One  Use  the  Datacomputer? 


What  Is  the  DATACOMPUTER? 

A Very  Large  Capacity  Centralized  Database  Management  Facility 

for  Distributed  Networics 


Why  Is  ths 
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What  Is  a 

Datacomputer  Like  Physically? 

• Software  Running  on  Standard  DEC 
Equipment 

• Trillion-Bit  Store 

• Interface  to  Network 


OATACOHniTEfl 
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Where  Are  There 

Datacomputer  Installations  Today? 

• At  CCA 

ArpafMl 
Ampex  TBM 

• At  Naval  Electronics  Laboratory  Center 

Special  N«t«iK>rfc 
No  Mess  Memory 
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What  Are  Some  Applications  of  the 
Datacomputer  Today? 

• Complex  Shared  Scientific  Database 
Seismic  Data  Network 


Bulk  Data  Storage  and  Retrieval 
Harvard  University  Computing  Center 
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How  Can  One  Use  the 
Datacomputer? 

• CCA  System  Available  to  Arpanet  Users 

• Datacomputer  Software  Installed  on  Any 

PDP-10  or  PDP-20 

• Customized  Applications 


Datacomputer  User’s  View 


PrognMiis 


DATACOMPUTER 


'TenniiMi  Umts 


Network  Hosts 


OATACOMPUTBt 
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Oatalanguage 

• Uniform,  High-Level  Interface 


Database  Management 

• Efficient  Retrievai  and  Maintenance  Faciiities 

• Extensive  Data  Description  Facilities 

• Flexible  Access  Controls 


li  1 1 ir  ifi  i« 
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Datacomputer  Architecture 
Hardware 

• Components 

DEC  System/10  or  DEC  System/20 

Mass  Memory 

Disks 

Interface  Message  Processor 

•Triinon<Bit  Store 

Economies  of  Mass  Memory  Technology 
Ampex  TBM 


Datacomputer  Architecture 

Software 

• Staging  and  Interface  to  Trillion*Bit  Store 

• Use  of  Inversions 
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Inversions  Data  Proper 


Inversions  Data  Proper 


NAME : MINSK 

TYPE : TANKER 

NAME : INDEPENDENCE 
NAME : SARATOGA 

TYPE : CARRIER 
name:  ENTERPRISE 


SARATOGA 


CARRIER 


USA 


INDEPENDENCE  CARRIER 


USA 


MINSK 


CARRIER 


USSR 


enterprise 


TANKER 


UK 
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Inversions  Data  Proper 


! 


NAME : MINSK 
FLAQ:UK 
TYPE : TANKER 


NAME : INDEPENDENCE 
FLAG : USSR 
NAME : SARATOGA 
TYPE : CARRIER 
FLAG : USA 


SARATOGA 

CARRIER 

USA 

INOB>ENOENCE 

CARRIER 

USA  { 

s 

MINSK 

CARRIER 

USSR  1 

ENTERPRISE 

TANKER 

UK 

NAME : ENTERPRISE 


Datacomputer  Applications 

• Bulk  Data  Storage  and  Retrieval 

• Command  and  Control 

• Shared  Scientific  Database 


d, 
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Seismic  Data  Network 


Summary 

• GCA’s  Datacomputer— Developed  for  ARPA 

• Over  100  Users  on  the  Arpanet 

• Very  Large  Capacity,  Very  Economical 
On-Line  Storage 

• Convenient  Way  of  Sharing  Data 

• Easily  Arranged  Use  of  Service  Over  Arpanet 

• Software  Installed  on  Standard  DEC  Equipment 

• CCA  Peady  to  Help  Application  Development 
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Very  Large  Databases:  FTR 
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